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Abstract: In many atmospheric and earth sciences, it is of interest to identify 
dominant spatial patterns of variation based on data observed at p locations and 
n time points with the possibility that p > n. While principal component anal¬ 
ysis (PCA) is commonly applied to find the dominant patterns, the eigenimages 
produced from PCA may exhibit patterns that are too noisy to be physically 
meaningful when p is large relative to n. To obtain more precise estimates of 
eigenimages, we propose a regularization approach incorporating smoothness 
and sparseness of eigenimages, while accounting for their orthogonality. Our 
method allows data taken at irregularly spaced or sparse locations. In addition, 
the resulting optimization problem can be solved using the alternating direction 
method of multipliers, which is easy to implement, and applicable to a large spa¬ 
tial dataset. Furthermore, the estimated eigenfunctions provide a natural basis 
for representing the underlying spatial process in a spatial random-effects model, 
from which spatial covariance function estimation and spatial prediction can be 
efficiently performed using a regularized fixed-rank kriging method. Finally, the 

effectiveness of the proposed method is demonstrated by several numerical ex¬ 
amples. 

Keywords: Alternating direction method of multipliers, empirical orthogonal 
functions, fixed rank kriging, Lasso, non-stationary spatial covariance estima¬ 
tion, orthogonal constraint, smoothing splines. 
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1. Introduction 

In many atmospheric and earth sciences, it is of interest to identify dominant spatial 
patterns of variation based on data observed at p locations with n repeated measure¬ 
ments, where p may be larger than n. The dominant patterns are the eigenimages 
of the underlying (nonstationary) spatial covariance function with large eigenvalues. 
A commonly used approach for estimating the eigenimages is the principal compo¬ 
nent analysis (PCA), also known as the empirical orthogonal function analysis in 
atmospheric science. However, when p is large relative to n, the leading eigenimages 
produced from PCA may be noisy with high estimation variability, or exhibit some 
bizarre patterns that are not physically meaningful. To enhance the interpretabil- 
ity, a few approaches, such as rotation of components according to some criteria 
(see Richman (1986), Jolliffe (1987), Richman (1987)), have been proposed to form 
more desirable patterns. However, how to obtain a desired rotation in practice is not 
completely clear. Some discussion can be found in Hannachi, Jolliffe and Stephenson 
(2007). 

Another approach to aid interpretation is to seek sparse or spatially localized pat¬ 
terns, which can be done by imposing an L i constraint or adding an L\ penalty 
to an original PCA optimization formulation (Jolliffe, Uddin and Vines (2002), Zou, 
Hastie and Tibshirani (2006), Shen and Huang (2008), d’Aspremont, Bach and Ghaoui 
(2008), and Lu and Zhang (2012)). However, this approach may produce a pattern 
with isolated zero and nonzero components, and except Jolliffe, Uddin and Vines 
(2002) and Lu and Zhang (2012), the PC estimates produced from these approaches 
may not have orthogonal PC loadings. 

For continuous spatial domains, the problem becomes even more challenging. In¬ 
stead of looking for eigenimages on a lattice, we need to find eigenfunctions by es¬ 
sentially solving an infinite dimensional problem based on data observed at possibly 
sparse and irregularly spaced locations. Although some approaches have been devel¬ 
oped using functional principal component analysis (see e.g., Ramsay and Silverman 


/Regularized, Principal Component Analysis for Spatial Data 3 

(2005), Yao, Muller and Wang (2005) and Huang, Shen and Buja (2008)), they typi¬ 
cally focus on one-dimensional processes, or require data observed at dense locations. 
In particular, these methods generally do not work well when data are observed at 
fixed but sparse locations. Reviews of PCA on spatial data can be found in Hannachi, 
Jolliffe and Stephenson (2007) and Demsar et al. (2013). 

In this research, we propose a regularization approach for estimating the dominant 
patterns, taking into account smoothness and localized features that are expected in 
real-world spatial processes. The proposed estimates are directly obtained by solving 
a minimization problem. We call our method SpatPCA , which not only gives effective 
estimates of dominant patterns, but also provides an ideal set of basis functions for 
estimating the underlying (nonstationary) spatial covariance function, even when data 
are irregularly or sparsely located in space. In addition, we develop a fast algorithm 
to solve the resulting optimization problem using the alternating direction method 
of multipliers (ADMM) (see Boyd et al. (2011)). An R package called SpatPCA is 
developed and available on the Comprehensive R Archive Network (CRAN). 

The rest of this paper is organized as follows. In Section 2, we introduce the pro¬ 
posed SpatPCA method, including dominant patterns estimation and spatial covari¬ 
ance function estimation. Our ADMM algorithm for computing the SpatPCA es¬ 
timates is provided in Section 3. Some simulation experiments that illustrate the 
superiority of SpatPCA and an application of SpatPCA to a global sea surface tem¬ 
perature dataset are presented in Section 4. 

2. The Proposed Method 

Consider a sequence of zero-mean L 2 -continuous spatial processes, {7/j(s);s e D}\ 
i = 1,... ,7i, defined on a spatial domain D C M d , which are mutually uncorrelated, 
and have a common spatial covariance function, C v (s,s*) = cov(r/j(s), p/s*)). We 
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consider a rank-A' spatial random-effects model for r/ 4 (•): 

K 

Vi(s) = (<Pi(s), • • • ,<p K (s))£i = ^2Ci k <Pk(s)-, scD, i = l,...,n, 

k= 1 

where (<pfc(.)} are unknown orthonormal basis functions, & = (£u, • • • jChc)' rv.; (0, A); 
z = 1, ...,n, are uncorrelated random variables, and A is an unknown symmetric 
nonnegative-definite matrix, denoted by A A 0. A similar model based on given 
was introduced by Cressie and Johannesson (2008) and in a Bayesian frame¬ 
work by Kang and Cressie (2011). 

Let A kk' be the (k,k ')~th entry of A. Then the spatial covariance function of 
is: 

K K 

C v (s,s*) = co v(rji(s),rii(s*)) = ^ ^ \ k k'<Pk(s)ip k /s*). (1) 

k= 1 k'= 1 

Note that A is not restricted to be a diagonal matrix. 

Let A = VA* V be the eigen-decomposition of A, where V consists of K orthonor¬ 
mal eigenvectors, and A* = diag(A*,..., A^) consists of eigenvalues with A* > • • • > 
X* K . Let = V’ii and 

(^(s),...,V7^(s)) = (y>i(s),..., ¥>*(«)) V; s CD. 

Then <p£(-)’s are also orthonormal, and £* k ~ (0, A£); i — 1,..., n, k — 1,..., K, are 
mutually uncorrelated. Therefore, we can rewrite ?/*(■) in terms of <yj£(-)’s: 

K 

m W = M»),-.,AW)C = EM 8 ); seD - p) 

k= 1 

The above expansion is known as the Karhunen-Loeve expansion of r/,(-) (Karhunen 
(1947); Loeve (1978)) with K nonzero eigenvalues, where </?£(■) is the fc-th eigenfunc¬ 
tion of C v (-, •) with X* k the corresponding eigenvalue. 

Suppose that we observe data Y* = (Y/si), ... ,Yi(s p )y with added white noise 
€j ~ (0, a 2 1 ) at p spatial locations, s 1; . .., s p E D, according to 


Y i = rj i + e i = + €*; i 


1 ,■■■,«, 


( 3 ) 
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where ip = (rj/si ),..., rj/Sp))', $ = (0 1; ..., <f> K ) is a px K matrix with the (j, k)- th 
entry tpi/sj), and e/s and £/s are uncorrelated. Our goal is to identify the first L < K 
dominant patterns, cpi(-),... with relatively large X\,... ,X* L . Additionally, we 

are interested in estimating C v (-, ■), which is essential for spatial prediction. 

Let Y = (Yi,..., Y n )' be the nxp data matrix. Throughout the paper, we assume 
that the mean of Y is known as zero. So the sample covariance matrix of Y is 
S = Y'Y/n. A popular approach for estimating {</?£(•)} is PCA, which estimates 
(^(sjJ,..., p* k (s p ))' by <fi k , the fc-th eigenvector of S, for k = Let = 

(0i,..., 4>k) be a px K matrix formed by the first K principal component loadings. 
Then $ solves the following constrained optimization problem: 

min || V — subject to ffdfh = Ik, 

& 

1 /2 

where $ = (0i,...,0x) and ||iW||j? = ( m tj) is ^ ie Frobenius norm of a 

matrix M. Unfortunately, $ tends to have high estimation variability when p is large 
(leading to excessive number of parameters), n is small, or a 2 is large. Consequently, 
the patterns of $ may be too noisy to be physically interpretable. In addition, for a 
continuous spatial domain D, we also need to estimate <£>£(s)’s for locations with no 
data observed (i.e., s {s 1; ..., s p }); see some discussion in Section 12.4 and 13.6 of 
Jolliffe (2002). 

2.1. Regularized Spatial PCA 


To prevent high estimation variability of PCA, we adopt a regularization approach 
by minimizing the following objective function: 

I< K p 

II Y - y**'||j + r, £ JW + t 2 J2 |w(*j)|, (4) 

k =1 k =1 j =1 


over (^i(-),..., <Pk{'), subject to <& / <E > = Ik and 0 / 1 5 , 0i > 0' 2 S0 2 > ■ ■ ■ > 4>' k S4>k, 
where 


J(p) 



ds , 
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is a roughness penalty, s = (ay ,...,Xd)', ^ > 0 is a smoothness parameter, and 
r 2 > 0 is a sparseness parameter. The objective function (4) consists of two penalty 
terms. The first one is designed to enhance smoothness of ipk{ m ) through the smoothing 
spline penalty while the second one is the Li Lasso penalty (Tibshirani (1996)), 

used to promote sparse patterns by shrinking some PC loadings to zero. While the 
L] penalty alone may lead to isolated zero and nonzero components with no global 
feature, when it is paired with the smoothness penalty, local sparsity translates into 
global sparsity, resulting in connected zero and nonzero patterns. Hence the two 
penalty terms together lead to desired patterns that are not only smooth but also 
localized. Specifically, when ly is larger, <£ fe (-)’s tend to be smoother and vice versa. 
When t 2 is larger, ^.(-)’s are forced to be zero at some sgD. On the other hand, when 
both t i and r 2 are close to zero, the estimates are close to those obtained from PCA. 
By suitably choosing 7y and r 2 , we can obtain a good compromise among goodness of 
fit, smoothness of the eigenfunctions, and sparseness of the eigenfunctions, leading to 
more interpretable results. Note that due to computational difficulty, the orthogonal 
constraint, is not considered by many PCA regularization methods (e.g., Zou, Hastie 
and Tibshirani (2006), Shen and Huang (2008), Guo et al. (2010), Hong and Lian 
(2013)). 

Although J(ip) involves integration, it is well known from the theory of smoothing 
splines (Green and Silverman (1994)) that for each k = 1,..., Ii, <p>k(-) has to be a 
natural cubic spline when d = 1, and a thin-plate spline when d G {2,3} with nodes 
at {si,..., s p }. Specifically, 


p d 

^( s ) = ^2 a ig(\\ s - s *id + b o +^2 b J x j , 

*= 1 3 =1 


where s — (ay,..., xy)', 


g(r) 


167T 


r 2 logr; 


r(d/2 - 2) , d _ 

lQ n d /2 


if d 
if d 


2 , 

1,3, 


( 5 ) 
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and the coefficients a — (ai,..., a p )' and b = (b 0 , bi ,..., b d )' satisfy 


G 

E 


a 


4>k 

E t 

0 


b 


0 


Here G is a p x p matrix with the (i, j)-th element g(||sj — Sj||), and E is a p x (d +1) 
matrix with the i-th row (1, s'). Consequently, (pk{-) in (5) can be expressed in terms 
of 4>}~. Additionally, the roughness penalty can also be written as 

J{<Pk) = (6) 

with f2 a known p x p matrix determined only by s 1; ..., s p . The readers are referred 
to Green and Silverman (1994) for more details regarding smoothing splines. 

From (4) and (6), the proposed SpatPCA estimate of $ can be written as: 


K 


K 


$ti,t 2 = arg min \\Y - Y $$'111 + H V (p[fl(pk + r 2 Y] Y ]\(/> jk \, (7) 

±K k= 1 k =1 j=l 

subject to 4>[Sc /)i > <p' 2 S<p 2 > • • • > 4>' k S(/)k- The resulting estimates of ^i(-),..., </?#(•) 
can be directly computed from (5). When no confusion may arise, we shall simply 


write $ 


T1,T 2 


as <h. Note that the SpatPCA estimate of (7) reduces to a sparse PCA 


estimate of Zou, Hastie and Tibshirani (2006) if the orthogonal constraint is dropped 
and f2 — I (i.e., no spatial structure is considered). 

The tuning parameters T\ and r 2 are selected using M -fold cross-validation (CV). 
First, we partition {1,..., n} into M parts with as close to the same size as possible. 
Let be the sub-matrix of Y corresponding to the m-th part, for m = 1,..., M. 
For each part, we treat Y ^ as the validation data, and obtain the estimate 
of $ for (ti, r 2 ) e A based on the remaining data Y A m ) using the proposed method, 
where A C [0, 00 ) 2 is a candidate index set. The proposed CV criterion is given in 
terms of an average residual sum of squares: 

1 


M 


CV 1 (ri,r 2 ) = jL J2 || V (m| - 

m =1 


12 

If 5 


( 8 ) 


where Y^&i^(<§>1^)' is the projection of Y^A onto the column space of 
The final T\ and r 2 values are (fi,f 2 ) = arg min CVi(ri, r 2 ). 








/Regularized, Principal Component Analysis for Spatial Data 


2.2. Estimation of Spatial Covariance Function 

To estimate C^(-, •) in (1), we also need to estimate the spatial covariance parameters, 
a 2 and A. We apply the regularized least squares method of Tzeng and Huang (2015): 


(d 2 , A) = argmin ( JllS - $A$'- a 2 T|| 2 + 7 ||$A$ , ||*), (9) 

(o- 2 ,a):o- 2 >o, a^o 12 J 

where 7 > 0 is a tuning parameter, and HATH* = tr^AT'AT) 1 / 2 ) is the nuclear 

norm of AT. The first term of (9) corresponds to goodness of fit by noting that 

var(Yi) = $A$' + a 2 I. The second term of (9) is a convex penalty, shrinking the 

eigenvalues of $A$' to promote a low-rank structure and to avoid the eigenvalues 

being overestimated. By suitably choosing a tuning parameter 7 , we can control the 

bias, while reducing the estimation variability. This is particularly effective when K 

is large. 

Tzeng and Huang (2015) provides a closed-form solution for A, but requires an 
iterative procedure for solving <r 2 . We found that closed-form expressions for both a 2 
and A are available, and are shown in the following proposition with its proof given 
in the Appendix. 

Proposition 1. The solutions of (9) are given by 


A = V diag(A^,..., X* K )V', 

r 

Tl’-l'"' 

k =1 


a 


= < p-L 


- (tr(S)); 
P 


tr(S) - (dk - 7 )); if <h > 1 , 

if d\ < 7 , 


( 10 ) 

( 11 ) 


where TTliag(di,..., cIk)V ' is the eigen-decomposition of <&' S& with d± > ■ ■ ■ > dx, 

L 


L = max < L : d,L — 7 > 


p — L 


(tr(S) - - 7)), L = 1,..., if), (12) 


and XI = max(dfc — a 2 — 7 , 0 ); k = 1,..., K. 
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With A = (A kk') KxK given by (9) and <fk( s ) given by (5), the proposed estimate 
of C v (s, s*) is 

K K 

C v (s, s*) = ^2 ^ kk1 fik(.s)tp k '(s*). (13) 

k =1 k '=1 

Then the proposed estimate of (^(s),..., </?^(s)) is 


(£[(«),..., <p* K {s)) = ..., <p K (s))V ; s e D. 


We consider M-fold CV to select 7 . As in the previous section, we partition the 
data into M parts, Y^ 1 ), ..., Y^ M \ For m — 1,, M, we estimate var(W( - "A) by 
S( _m ) = 4»( _m )A^ _m ' ) (4» (_m )) / + m> I based on the remaining data by 

removing Y ^ from Y, where Ay~ m \ (<3^) ( ' m ' ) and are the estimates of A, 

a 2 and $ based on Y^~ m \ and for notational simplicity, their dependences on the 
selected ( 77 , 72 ) and K are suppressed. The proposed CV criterion is given by 

1 M 

CV 2 (A, 7 ) = — ll 5(m) - - (a 2 Y~ m ^I\\l , (14) 

m—1 

where S( m ' ) = (Y^/Y^ /n. Then the 7 selected by CV 2 based on K is 7 k = 
arg min CV 2 (K, 7 ). 

7>0 

The dimension of eigen-space K, corresponding to the maximum rank of ♦bAT*', 
could be selected by traditional approaches based on a given proportion of total varia¬ 
tion explained or the scree plot of the sample eigenvalues. However, these approaches 
tend to be more subjective and may not be effective for the covariance estimation pur¬ 
pose. We propose to select K using CV 2 of (14) by subsequently increase the value 
of K from K — 1,2,..., until no further reduction of the CV 2 value. Specifically, we 
select 


K = min {K : CV 2 (W, 7 *) < C V 2 (K + 1, 7 *+i), K = 1,2,...}. (15) 


3. Computation Algorithm 

Solving (7) is a challenging problem especially when both the orthogonal constraint 
and the L\ penalty are involved simultaneously. Consequently, many regularized PCA 
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approaches, such as sparse PCA (Zou, Hastie and Tibshirani, 2006), do not cope with 
the orthogonal constraint. We adopt the ADMM algorithm by decomposing the orig¬ 
inal constrained optimization problem into small subproblems that can be efficiently 
handled through an iterative procedure. This type of algorithm was developed early 
in Gabay and Mercier (1976), and was systematically studied by Boyd et al. (2011) 
more recently. 

First, the optimization problem of (7) is transferred into the following equivalent 
problem by adding an p x K parameter matrix Q: 

K K p 


min || Y-Y $$'||f + A ^ <P'k^k + A ^ ’ 

3>,QeRP xK 

k=1 k=1 j=l 


(16) 


subject to Q'Q = Ik , 4>[S<f)i > (f> 2 S<f >2 > • • • > 4>' k S<Pk, and a new constrain, 
= Q. Then the resulting constrained optimization problem of (16) is solved using 
the augmented Lagrangian method with its Lagrangian given by 

K K p 

l(*, q , n -ii y- yq&wi+ T] J 2 tktofa + 

fc=l k= 1 j =1 

+ tr(r'(#-Q)) + |||#-Q|||., 


subject to Q'Q = Ik and 4>[Scj) i > (j>' 2 S(t>2 > ■ ■ ■ > 4>' k S(/)k, where T is a p x K 
matrix of the Lagrange multipliers, and p > 0 is a penalty parameter to facilitate 
convergence. Note that the value of p does not affect the original optimization prob¬ 
lem. The ADMM algorithm iteratively updates one group of parameters at a time in 
both the primal and the dual spaces until convergence. Given the initial estimates, 
Q 1 ' 0 ' 1 and T (( h of Q and T, our ADMM algorithm consists of the following steps at 
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the 7-th iteration: 


<h (m) = argminL($,Q W ,r w ) 


K 

= arg min 

$ k =1 


- X(t> k 


J2r 2 \(j) j k\ 

3 =1 


Q (m) = arg min L($ (£+1) , Q. T^) = C/ W (V (£) )', 

Q.Q'Q=I k 


(17) 

(18) 
(19) 


r «+l) = r (t) + p ( # (f+l) _ Qd+P) j 

where X — (rxSl — Y'Y + p/p/2) 1 ! 2 , zf is the h column of X~ l (pQ^ - r ( d)/2, 
)' is the singular value decomposition of + p _ 1 r ( d, and p must 

be chosen large enough (e.g., twice the maximum eigenvalue of Y'Y) to ensure that 
X is positive-definite. Note that (17) is simply a Lasso problem (Tibshirani (1996)), 
which can be solved effectively using the coordinate descent algorithm (Friedman, 
Hastie and Tibshirani, 2010). 

Except (17), the ADMM steps given by (17)-(19) have closed-form expressions. 
In fact, we can make the algorithm involve only closed-form updates by further de¬ 
composing (17) into another ADMM step. Specifically, we can introduce another 
parameters r^’s to replace the last term of (16) and add the constraint, (pjk = r 3 k for 
j = 1,... ,p and k — 1,..., K, to form an equivalent problem: 


min \\Y -Y&& 

$,Q,R 


/ II 2 


K K p 

f -r n 4>'i^4>k + t 2 M > 

k =1 k =1 j =1 


subject to Q'Q = Ik, & = Q = R, and 4>[S<p i > (j>' 2 S(/> 2 > • • • > <p' K S<pK, where 
Tjk is the (j, fc)-th element of R. Then the corresponding augmented Lagrangian is 

K K p 

L(*, Q, R, r u r 2 ) = \\y - y*&\\ 2 f + + T * E E M 

k =1 k= 1 j =1 

+ tr(r;(# - Q)) + tr(r' 2 (# - fl)) 

+ f(ll*-QII?- +II*-«!!?■), 


subject to Q'Q = Ik and <f>[Sc/)i > <p' 2 S<f) 2 > • •• > c/)' k S<Pk, where r[ and T 2 
are p x K matrices of the Lagrange multipliers. Then the ADMM steps at the 7-th 
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iteration are given by 

$( m) = argminL($,QW,i? w ,rf ) ,rf) 

= + pi p - y'y)~ 1 {p{Q (t> + R^) — 1\ - r 2 }, ( 20 ) 

qC+D — argmin L(^ +1 \Q, R w .T[ e) ,T ( 2 e) ) = U {e) (V^)\ (21) 

Q:Q'Q=I k 

R [l+l) = argminL($ (m) ,Q (m) , 

R 

= -S T2 (p^ + V +Tf), ( 22 ) 

rf +1) = rf + P ($ (£+1) - q^) , ( 23 ) 

Y^ +l) = if } + p ($ (£+1) - R ( £+1 )) , (24) 


where R^°\ rf and rf arc initial estimates of R , r | and r 2 , respectively, (V^)' 

is the singular value decomposition of _|_ p~ l Y^\ and S T2 {-) is the element-wise 

soft-thresholding operator with a threshold r 2 (i.e., the (j, fc)-th element of S T2 (M) is 
sign(mjfc) max(|mjfc| — r 2 , 0) with rrijk the (j, k)-th element of M ). Similarly to (17), 
p must be chosen large enough to ensure that TiO + pl p — Y'Y in (20) is positive 
definite. 


4. Numerical Examples 


We conducted some simulation experiments in one-dimensional and two-dimensional 
spatial domains, and applied SpatPCA to a real-world dataset. We compared the 
proposed SpatPCA with three methods: (1) PCA (ti = r 2 = 0); (2) SpatPCA with 
the smoothness penalty only (r 2 = 0); (3) SpatPCA with the sparseness penalty only 
(ti = 0), based on the two loss functions. The first one measures the prediction ability 
in terms of an average squared prediction error: 


Loss($) = -V||^:-^|| 2 , 
n n ii 


( 25 ) 


2=1 


where $ is the true eigenvector matrix formed by the first K eigenvectors and 

X\ At 


'K 


At + d 2 


X*K + & 


V'&Y U 


& = Vdiag 
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is the empirical best linear unbiased predictor of with the estimated parameters 
plugged in. The second one concerns the goodness of covariance function estimation 
in terms of an average squared estimation error: 

Loss(C r? ) = — EE (C v (s u sj ) - C v (s h Sj )) 2 . (26) 

P i= 1 3 =1 

We applied the ADMM algorithm given by (20)-(24) to compute the SpatPCA 
estimates with p being ten times the maximum eigenvalue of Y'Y. The stopping 
criterion for the ADMM algorithm is 


— max(||^ +1 ) 

Vp 


<F w || f , ||$ (m) - R {e+1) \\ F , ||$ (£+1) - Q (£+1) || f ) < 1(T 4 . 


4-1. One-Dimensional Experiment 


In the hrst experiment, we generated data according to (3) with K = 2 , ^ ~ 
N(0, diag(Ai, A 2 )), e* ~ iV(0, J), n = 100, p = 50, Si, ..., S50 equally spaced in 
D = [—5,5], and 

Ms) = -exp(-(x? + ••• +x 2 d )), (27) 

Cl 

02 (s) = — X! ■ ■ -x d exp(-(xl A -bx^)), (28) 

c 2 

where s = (x±,... ,Xd)', C\ and c 2 are normalization constants such that ||0i || 2 = 
II 02 1 |2 = 1) an d d — 1 . We considered three pairs of (Ai,A 2 ) € {(9, 0 ), ( 1 , 0 ), (9,4)}, 
and applied the proposed SpatPCA with K 6 {1,2,5} and K selected from (15), 
resulting in 12 different combinations. For each combination, we considered 11 values 
of t 1 (including 0 , and the other 10 values from 1 to 10 3 equally spaced on the log 
scale) and 31 values of r 2 (including 0, and the other 30 values from 1 to 10 3 equally 
spaced on the log scale). But instead of performing a two-dimensional optimization 
by selecting among all possible pairs of (ti,t 2 ), we applied a more efficient two-step 
procedure involving only one-dimensional optimization. First, we selected among 11 
values of r\ by fixing r 2 = 0 using 5-fold CV of ( 8 ) with the initial estimate of 
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given by the first K eigenvectors of Y'Y — TyO as its columns. Note that this initial 
estimate is actually the true estimate $n,o when Y'Y — 77 V 0. Then we selected 
among 31 values of T2 with the selected 77 using 5-fold CV of (8). 

For covariance function estimation, we selected the tuning parameter 7 among 11 
values of 7 using 5-fold CV of (14), including 7 = 0 and the other 10 values from 1 
to d\ equally spaced on the log scale, where d\ is the largest eigenvalues of 

Figure 1 shows the estimates of and 0 2 ( - ) f° r the four methods based on three 
different combinations of eigenvalues. Each case contains two estimated functions 
based on two randomly generated datasets. As expected, the PCA estimates, which 
consider no spatial structure, are very noisy, particularly when the signal-to-noise 
ratio is small. Adding only the smoothness penalty (i.e., T2 = 0) makes the estimates 
considerably less noisy. But the resulting estimates show some obvious bias. On the 
other hand, adding only the sparseness penalty (i.e, 77 = 0) forces the eigenfunction 
estimates to be zeros at some locations. But the estimated patterns are still very 
noisy. Overall, our SpatPCA estimates reproduce the targets with little noise for 
all cases even when the signal-to-noise ratio is small, indicating the effectiveness of 
regularization. 

Figure 2 shows the covariance function estimates for the four methods based on 
a randomly generated dataset. The proposed SpatPCA can be seen to perform con¬ 
siderably better than the other methods for all cases by being able to reconstruct 
the underlying nonstationary spatial covariance functions without having noticeable 
visual artifacts. 

The performance of the four methods in terms of the loss functions (25) and (26) 
is shown in Figures 3 and 4, respectively, based on 50 simulation replicates. Once 
again, SpatPCA outperforms all the other methods in all cases. For (Ai, A2) = (9, 0), 
the average computation time for SpatPCA (including selection of Ai and A2 using 
5-fold CV) with K = 1,2,5 are 0.020, 0.065 and 0.264 seconds, respectively, which 
are larger than 0.002 seconds required for PCA. The results were conducted using our 
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Fig 1. Estimates of and 0 2 (-) obtained from various methods based on three different combina¬ 
tions of eigenvalues. Each panel consists of two estimates (in two different line types) corresponding 
to two randomly generated datasets, where the solid grey lines are the true eigenfunctions. 
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Fig 2. True covariance functions and their estimates obtained from various methods based on three 
different combinations of eigenvalues. 
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Fig 3. Boxplots of average squared prediction errors of (25) for various methods in the one¬ 
dimensional simulation experiment of Section f.l based on 50 simulation replicates. 


R package “SpatPCA” implemented on an iMac PC equipped with a 3.2GHz Intel 
Core i5 CPU and a 64GB RAM. 


4-2. Two-Dimensional Experiment I 

We considered a two-dimensional experiment by generating data according to (3) with 
K = 2, £ i ~ A/”(0, diag(Ai, A 2 )), e r ~ N(0,I), n = 500, si,...,s p regularly spaced 
at p = 20 2 locations in D = [—5, 5] 2 . Here 0i(-) and 02( - ) are given by (27) and (28) 
with d = 2 (see the images in the first column of Figure 5). 
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Fig 4. Boxplots of average squared estimation errors of (26) for various methods in the one¬ 
dimensional simulation experiment of Section ).l based on 50 simulation replicates. 
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Estimates of <^i(-) based on (Ai, A 2 ) = (9, 0) and K = 1 
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Fig 5. Estimates of f i(-) and $ 2 /) obtained from various methods based on three different combi¬ 
nations of eigenvalues. 
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We considered three pairs of (Ai, A 2 ) E {(9,0), (1,0), (9,4)}, and applied the pro¬ 
posed SpatPCA with K E {1, 2, 5} and K selected from (15). As in the one-dimensional 
experiment, we used 5-fold CV of (8) and a two-step procedure to select among the 
same 11 values of T\ and 31 values of r 2 . Similarly, we used 5-fold CV of (14) to select 
among the same 11 values of 7 for covariance function estimation. 

Figure 5 shows the estimates of 0i(-) and 0 2 (-) obtained from the four methods 
for various cases based on a randomly generated dataset. The performance of the 
four methods in terms of the loss functions (25) and (26) is summarized in Fig¬ 
ure 6 and Figure 7, respectively, based on 50 simulation replicates. Similarly to the 
one-dimensional examples, SpatPCA performs significantly better than all the other 
methods in all cases. For (Ai, A 2 ) = (9, 0), the average computation time for SpatPCA 
(including selection of Ai and A 2 using 5-fold CV) with K = 1,2,5 are 3.105, 4.242 
and 16.160 seconds, respectively, using the R package “SpatPCA” implemented on 
an iMac PC with a 3.2GHz Intel Core i5 CPU and a 64GB RAM. While SpatPCA 
is slower than PCA (requiring only 0.267 seconds), it is reasonably fast and provides 
much improved results. 

4-3. An Application to a Sea Surface Temperature Dataset 

Since the proposed SpatPCA works better when both smoothness and sparseness 
penalties are involved according to the simulation experiments in Sections 4.1 and 
4.2, we applied the proposed SpatPCA with both penalty terms to a sea surface 
temperature (SST) dataset observed over a region in the Indian Ocean, and only 
compared it with PCA. The data are monthly averages of SST obtained from the 
Met Office Marine Data Bank (available at http: //www.metoff ice .gov.uk/hadobs/ 
hadisst/) on 1 degree latitude by 1 degree longitude (1° x 1°) equiangular grid cells 
from January 2001 to December 2010 in the region between latitudes 20° N and 20°S' 
and between longitudes 39 °E and 120 °E. Out of 40 x 81 = 3, 240 grid cells, there 
are 460 cells on the land where no data are available. Hence the data we used are 
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Fig 6. Boxplots of average squared prediction errors of (25) for various methods in the two- 
dimensional simulation experiment of Section f.2 based on 50 simulation replicates. 
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Fig 7. Boxplots of average squared estimation errors of (26) for various methods in the two- 
dimensional simulation experiment of Section f.2 based on 50 simulation replicates. 
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from PCA 4>i(’) from SpatPCA 



Fig 8. Estimated eigenimages obtained form PCA and SpatPCA over a region in the Indian Ocean, 
where the gray regions correspond to the land. 

observed at p — 2, 780 cells and 120 time points. We first detrended the SST data by 
subtracting the SST for a given cell and a given month by the average SST for that 
cell and that month over the whole period. We decomposed the data into two parts 
with one part consisting of 60 time points of {1,3,..., 119} for training data, and the 
other part, consisting of 60 time points of even numbers, for validation purpose. 

We applied SpatPCA on the training data with K selected by K of (15). Similar 
to the two-step method described in Section 4.1, we selected among 11 values of T\ 
(including 0, and the other 10 values from 10 3 to 10 s equally spaced on the log scale) 
and 31 values of r 2 (including 0, and the other 30 values from 1 to 10 3 equally spaced 
on the log scale) using 5-fold CV of (8). For both PCA and SpatPCA, we applied 
5-fold CV of (14) to select among 11 values of 7 (including 0 and other 10 values 
from (A/10 3 to d\ equally spaced on the log scale), where d\ is the largest eigenvalue 
of 

The first two dominant patterns estimated from PCA and SpatPCA are shown 
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Fig 9. Mean squared errors of covariance matrix estimation with respect to K for PC A and SpatPCA. 

in Figure 8. Both methods identify similar patterns with the ones estimated from 
SpatPCA being a bit smoother than those estimated from PCA. The first pattern 
is a basin-wide mode and the second one corresponds to the east-west dipole mode 
(Deser et al. (2009)). 

We used the validation data to evaluate the performance between PCA and Spat¬ 
PCA in terms of the mean squared error (MSE), ||S — S v \\ 2 F /p 2 , where X is a generic 
estimate of var(W) based on the training data, and S v is the sample covariance matrix 
based on the validation data. The resulting MSE for PCA is 1.05 x 10 -4 , which is 
slightly larger than 1.02 x 10 -4 for SpatPCA. Figure 9 shows the MSEs with respect 
to various K values for both PCA and SpatPCA. The results indicate that SpatPCA 
is not sensitive to the choice of K as long as K is sufficiently large. Our choice of 
K = 6 for SpatPCA based on (15) appears to be effective, and is smaller than K = 15 
for PCA. 

4-4• Two-Dimensional Experiment II 

To reflect a real-world situation, we generated data by mimicking the SST dataset 
analyzed in the previous subsection, except we applied a larger noise variance. Specif¬ 
ically, we generated data according to (3) with K = 2, & ~ Af(0, diag(Ai, A 2 )), 
€j ~ N(0,I), n = 60, and at the same 2,780 locations from the SST dataset. Here 
0i(-) and 02(-) are given by <pi(-) and (see Figure 8) and (Ai, A2) = (91.3,16.1) 
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</>i(-) from PCA ^i(’) from SpatPCA 
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Fig 10. Estimates of (fii(-) and obtained from PCA and SpatPCA in the two-dimensional 

experiment of Section f.f based on a randomly simulated dataset, where the areas in gray are the 
land. 

estimated by SpatPCA in the previous subsection. 

We applied the 5-fold CV of (14) and (15) to select the tuning parameters (ri,r 2 ) 
and K in the same way as in the previous subsection. Figure 10 shows the estimates 
of </>i(-) and 02(0 for PCA and SpatPCA based on a randomly generated dataset. 
Because we consider a larger noise variance than those in the previous subsection, 
the first two patterns estimated from PCA turn out to be very noisy. In contrast, 
SpatPCA can still reconstruct the first two patterns very well with little noise. The 
results in terms of the loss functions of (25) and (26) are summarized in Figure 11. 
Once again, SpatPCA outperforms PCA by a large margin. 
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Appendix 

Proof of Proposition 1. First, we prove (10). From Corollary 1 of Tzeng and Huang 
(2015), the minimizer of h( A, cr 2 ) given a 2 is 


A(a 2 ) = V diag((di - a 2 - y) + ,..., (d K -a 2 - 7 )+) V'. (29) 


Hence (10) is obtained. 

Next, we prove (11). Rewrite the objective function of (9) as: 



(30) 


From (29) and (30), we have 


h(A(a 2 ), a 2 ) 



k= 1 
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Minimizing h(A(a 2 ), a 2 ), we obtain 

I< 


in j pa A - 2 cx 2 tr (S) - V] (d k - a 2 - 7) + )• 

1 ^ fc=i ' 


a = arg mm 

cr 2 >0 


Clearly, if d\ < 7 , then a 2 = -tr (S). We remain to consider d\ > 7 . Let 

V 


27 


(31) 


From (31), b 2 = 


d~ L * - 7 > 


L* = max [L : dp — 7 > b 2 , L — 1,..., iC}. 


— 7 ) J. It suffices to show that L* = L. Since 


p — L* 


k= 1 


P 


L* V 


tr(5) - — 7 ) ), by the dehnition of L, we have L > L *, 


k =1 


implying 4 > 4*- Suppose that L > L*. It immediately follows from the dehnition 
of L* that dp — 7 < b 2 < dp, — 7 , which contradicts to dp > dp*. Therefore, L = L*. 
This completes the proof. □ 
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